programming4us
           
 
 
Windows

Windows Azure : Exploring Full-Text Search (part 2) - Building an FTS Engine on Azure

- Free product key for windows 10
- Free Product Key for Microsoft office 365
- Malwarebytes Premium 3.7.1 Serial Keys (LifeTime) 2019
10/22/2010 6:08:09 PM

3. Building an FTS Engine on Azure

That was quite a bit of theory to set up what you will do next: build your own FTS engine on Windows Azure storage.

3.1. Picking a data source

The first thing you need is some data to index and search. Feel free to use any data you have lying around. The code you are about to see should work on any set of text files.

To find a good source of sample data let’s turn to an easily available and widely used source: Project Gutenberg. This amazing project provides thousands of free books online in several accessible licenses under a very liberal license. You can download your own copies from http://www.gutenberg.org. If you’re feeling lazy, you can download the exact Gutenberg book files that have been used here from http://www.sriramkrishnan.com/windowsazurebook/gutenberg.zip.

Why use plain-text files and not some structured data? There is no reason, really. You can easily modify the code samples you’re about to see and import some structured data, or data from a custom data source.

3.2. Setting up the project

To keep this sample as simple as possible, let’s build a basic console application. This console application will perform only two tasks. First, when pointed to a set of files in a directory, it will index them and create the inverted index in Windows Azure storage. Second, when given a search query, it will search the index in Azure storage. Sounds simple, right?

  1. Create a .NET 3.5 Console Application project using Visual Studio. In this sample, call the project FTS, which makes the project’s namespace FTS by default. If you’re calling your project by a different name, remember to fix the namespace.

  2. Add references to the assemblies System.Data.Services.dll and System.Data.Services.Client.dll. This brings in the assemblies you need for ADO.NET Data Services support.

  3. Bring in the Microsoft.WindowsAzure.StorageClient library to talk to Azure storage.

  4. Set up the configuration file with the right account name and shared key by adding a new App.config to the project and entering the following contents. Remember to fill in your account name, key, and table storage endpoint:

    <?xml version="1.0" encoding="utf-8" ?>
    <configuration>

    <appSettings>
    <add key="DataConnectionString" value
    ="AccountName=YourAccountName;AccountKey=YourAccountKey==;
    DefaultEndpointsProtocol=https"/>
    </appSettings>
    <system.net>
    <settings>
    <servicePointManager expect100Continue="false" useNagleAlgorithm="false" />
    </settings>
    </system.net>
    </configuration>


3.3. Modeling the data

As you learned earlier, you must create two key data structures. The first is a mapping between document IDs and documents. You will be storing that in a table in Azure storage. To do that, you use the following wrapper class inherited from TableServiceEntity. Add the following code as Document.cs to your project:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data.Services;
using System.Data.Services.Client;
using Microsoft.WindowsAzure.StorageClient;

namespace FTS
{
public class Document:TableServiceEntity
{
public Document( string title, string id):base(id, id)
{

this.Title = title;
this.ID = id;
}

public Document():base()
{
//Empty-constructor for ADO.NET Data Services
}

public string Title { get; set; }
public string ID { get;set;}
}


This class wraps around an “entity” (row) in a Document table. Every entity has a unique ID, and a title that corresponds to the title of the book you are storing. In this case, you are going to show only the title in the results, so you’ll be storing only the title in Azure storage. If you wanted, you could choose to store the contents of the books themselves, which would let you show book snippets in the results. You use the document ID as the partition key, which will place every document in a separate partition. This provides optimum performance because you can always specify the exact partition you want to access when you write your queries.

The second key data structure you need is an inverted index. As discussed earlier, an inverted index stores a mapping between index terms and documents. To make this indexing easier, you use a small variant of the design you saw in Table 11-2.

In that table, you saw every index term being unique and mapping to a list of document IDs. Here, you have a different table entry for every index term-document ID pair. This provides a lot of flexibility. For example, if you move to a parallel indexing model, you can add term-to-document ID mappings without worrying about trampling over a concurrent update.

Save the following code as IndexEntry.cs and add it to your project:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data.Services;
using System.Data.Services.Client;
using Microsoft.WindowsAzure.StorageClient;

namespace FTS
{
public class IndexEntry:TableServiceEntity
{
public IndexEntry(string term, string docID)
: base(term, docID)
{
this.Term = term;
this.DocID = docID;
}

public IndexEntry()
: base()
{
//Empty constructor for ADO.NET Data Services
}
public string Term { get; set; }
public string DocID { get; set; }
}
}


At this point, you might be wondering how you just get a list of documents in which a term appears easily and quickly using this design. To make that happen, note that, in the code, all entries with the same term will go into the same partition, because you use “term” as the partition key. To get a list of documents in which a term appears, you just query for all entities within the term partition.

This is easier to understand with the help of a picture. Figure 1 shows the index table containing the mappings for two terms, foo and bar. Since each term gets its own partition, the index table has two partitions. Each partition has several entries, each corresponding to a document in which the term appears.

Figure 1. Index table with two partitions


This essentially wraps around the two classes you just wrote, and enables you to query them from ADO.NET Data Services:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.WindowsAzure;
using Microsoft.WindowsAzure.StorageClient;
using System.Data.Services.Client;

namespace FTS
{
public class FTSDataServiceContext:TableServiceContext
{
public FTSDataServiceContext(string baseAddress,
StorageCredentials credentials)
: base(baseAddress, credentials)
{
}


public const string DocumentTableName = "DocumentTable";

public IQueryable<Document> DocumentTable
{
get
{
return this.CreateQuery<Document>(DocumentTableName);
}
}

public const string IndexTableName = "IndexTable";

public IQueryable<IndexEntry> IndexTable
{
get
{
return this.CreateQuery<IndexEntry>(IndexTableName);
}
}


}
}


3.4. Adding a mini console

The following trivial helper code enables you to test out various text files and search for various terms. Replace your Program.cs with the following code. This essentially lets you call out to an Index method or a Search method based on whether you enter index or search in the console. You’ll be writing both very soon, so let’s just leave stub implementations for now:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using Microsoft.WindowsAzure.StorageClient;
using Microsoft.WindowsAzure;
namespace FTS
{
class Program
{
static void Main(string[] args)
{
CreateTables();
Console.WriteLine("Enter command - 'index <directory-path>'
or 'search <query>' or 'quit'");
while (true)
{

Console.Write(">");
var command = Console.ReadLine();
if (command.StartsWith("index"))
{
var path = command.Substring(6, command.Length - 6);
Index(path);
}
else if (command.StartsWith("search"))
{
var query = command.Substring(6, command.Length - 6);
Search(query);
}
else if (command.StartsWith("quit"))
{
return;
}
else
{
Console.WriteLine("Unknown command");
}
}

}
static void Index(){}

static void Search(){}
}
}


3.5. Creating the tables

At the top of Main, you see a CreateTables method call. As the name implies, this creates your tables in Azure table storage if they don’t already exist. To do that, add the following code below Main in Program.cs:

static void CreateTables()
{

var account = CloudStorageAccount.Parse(ConfigurationSettings.AppSettings
["DataConnectionString"]);
var svc =
new FTSDataServiceContext(account.TableEndpoint.ToString(),
account.Credentials);


account.CreateCloudTableClient().CreateTableIfNotExist
(FTSDataServiceContext.IndexTableName);
account.CreateCloudTableClient().CreateTableIfNotExist
(FTSDataServiceContext.DocumentTableName);


}


Other -----------------
- Windows Azure: Building a Secure Backup System (part 6) - Uploading Efficiently Using Blocks
- Windows Azure: Building a Secure Backup System (part 5)
- Windows Azure: Building a Secure Backup System (part 4)
- Windows Azure: Building a Secure Backup System (part 3)
- Windows Azure: Building a Secure Backup System (part 2) - Protecting Data in Motion
- Windows Azure: Building a Secure Backup System (part 1)
- Understanding Windows Azure Roles
- The Windows Azure Tool Set
- Windows Azure Table Overview (part 2) - Azure Tables Versus Traditional Databases
- Windows Azure Table Overview (part 1) - Core Concepts
- Exploring Group Policy in Windows 7
- Working with Multiple Local Group Policy Objects
- The Windows Azure Sandbox
- Windows Azure : Peeking Under the Hood with a Command Shell (part 2) - Running the Command Proxy
- Windows Azure : Peeking Under the Hood with a Command Shell (part 1) - Building the Command Shell Proxy
- Windows 7 : Using Any Search Engine from the Address Bar
- Windows 7 : Understanding Internet Explorer Advanced Options
 
 
 
Top 10
 
- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Finding containers and lists in Visio (part 2) - Wireframes,Legends
- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Finding containers and lists in Visio (part 1) - Swimlanes
- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Formatting and sizing lists
- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Adding shapes to lists
- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Sizing containers
- Microsoft Access 2010 : Control Properties and Why to Use Them (part 3) - The Other Properties of a Control
- Microsoft Access 2010 : Control Properties and Why to Use Them (part 2) - The Data Properties of a Control
- Microsoft Access 2010 : Control Properties and Why to Use Them (part 1) - The Format Properties of a Control
- Microsoft Access 2010 : Form Properties and Why Should You Use Them - Working with the Properties Window
- Microsoft Visio 2013 : Using the Organization Chart Wizard with new data
- First look: Apple Watch

- 3 Tips for Maintaining Your Cell Phone Battery (part 1)

- 3 Tips for Maintaining Your Cell Phone Battery (part 2)
programming4us programming4us